The objective of the task is to build a predictive model using machine learning models that would be able to predict a rent per square meter in Hamburg city.
Additional geo-data information or features has to be fetched from the external sources and concatenate with the dataset given. The external features could be anything useful for the analysis such as places nearby, bus stops, railway stations etc.
Number of variables: 15
Number of observations: 12500
Data profiling is also done using the pandas data profiling, by using this we can automatically generate the profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() and does the exploratory analysis for us. It is fast and efficient. This helps to understand the data very well. The report is also present in the folder by the name of data_profile.
If you need all the libraries I have used for this task, please remove the hashtags and install it directly from this script for example "!{sys.executable} -m pip install geopandas". You have to just run it and it will download the library you need
import sys
# !{sys.executable} -m pip install geopandas
# !{sys.executable} -m pip install mplleaflet
# !{sys.executable} -m pip install bs4
# !{sys.executable} -m pip install geocoder
# !{sys.executable} -m pip install geopy
# !{sys.executable} -m pip install folium
# !{sys.executable} -m pip install lxml
# !{sys.executable} -m pip install pygeoj
# !{sys.executable} -m pip install pyshp
# !{sys.executable} -m pip install datetime
# !{sys.executable} -m pip install seaborn
# !{sys.executable} -m pip install --upgrade cython
# required package for timeseries
# !{sys.executable} -m pip install statsmodels
# # required package to run geopandas
#!conda install -c conda-forge libspatialindex -y
# # required packages for neighbourhood analysis
# !{sys.executable} -m pip install geopandas
# !{sys.executable} -m pip install descartes
# !{sys.executable} -m pip install requests
# # requiered packages for accessibility analysis
# # Make sure Cython is upgraded FIRST!
# !{sys.executable} -m pip install pandana
# requiered packages for modelling
#!{sys.executable} -m pip install xgboost
%matplotlib inline
# general
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Geographical analysis
import geopandas as gpf #libspatialindex nees to be installed first
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import geopandas as gpd
import shapefile as shp
import datetime
from statsmodels.tsa.seasonal import seasonal_decompose
import requests
import descartes
import rtree
# accessibility analysis
import time
# from pandana.loaders import osm
# from pandana.loaders import pandash5
# modelling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn import metrics
import xgboost as xgb
from xgboost import plot_importance
from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score
#Hide warnings
import warnings
warnings.filterwarnings('ignore')
# Set plot preference
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)
print('Libraries imported.')
from pathlib import Path
# Installed packages
from ipywidgets import widgets
# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
# import plotly.plotly as py
# import plotly.graph_objs as go
# from plotly.offline import iplot, init_notebook_mode
from chart_studio.plotly import plot, iplot
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
# Load the dataset
df = pd.read_csv('task_data_final.csv')
print(f"The dataset contains {len(df)} Condo Rent listings")
pd.set_option('display.max_columns', len(df.columns)) # To view all columns
pd.set_option('display.max_rows', 100)
df.head(10)
df.count()
df.head()
df.tail()
profile = ProfileReport(df, title="Condo Rent Prices", html={'style': {'full_width': True}}, sort="None")
# Or use the HTML report in an iframe
profile.to_notebook_iframe()
profile.to_file(output_file="data_profile.html")
df.describe()
From the above summary we can notice one thing in rooms column and that is there is no 1 room apartment available for rent, the minimum number of rooms are 1.5. The maximum number of rooms are 9.
The minimum cold rent is 159.60 and the maximum is 4788.00. The minimum rent per suare meter is 6.00 and the maximum is 35.49.
# Checking the null values in a dataset
df.isnull().any()
# Sum of null values
df.isnull().sum()
There are many missing values in the dataset especially the number_of_bedrooms i.e 7722. It would be better to get rid of the columns with so many missing values.
def percent_value_counts(df, feature):
"""This function takes in a dataframe and a column and finds the percentage of the value_counts"""
percent = pd.DataFrame(round(df.loc[:,feature].value_counts(dropna=False, normalize=True)*100,2))
## creating a df with the
total = pd.DataFrame(df.loc[:,feature].value_counts(dropna=False))
## concating percent and total dataframe
total.columns = ["Total"]
percent.columns = ['Percent']
return pd.concat([total, percent], axis = 1)
percent_value_counts(df, 'number_of_bedrooms')
percent_value_counts(df, 'construction_year')
We can see the total number of missing values and percentage of missing values for a particular attribute.
# Plotting the distribution of numerical and boolean categories
df.hist(figsize=(20,20));
From the above, it can be seen that several columns only contain few categories and can be dropped such as number_of_bedrooms, construction_year, heating_type. We can also remove lambert position, we don't need that. Later we can see what else we can remove.
df.drop(['number_of_bedrooms', 'construction_year', 'heating_type', 'lambert_poistion_x', 'lambert_poistion_y'], axis=1, inplace=True)
df.isna().sum()
Now still we have missing values, we need the locations of the apartments and 80 are missing, we can get rid of that.
df = df.dropna()
df.isna().sum()
df.count()
Now the dataset is clean and perfect for the analysis and prediction.
Let's start with the bar graph, let's see the cold rent according to the rooms. It is but obvious that if the number of rooms increase the rent also increases. Let's check that.
rent_pivot = df.pivot_table(index="rooms",values="cold_rent")
rent_pivot.plot.bar()
plt.show()
In the above bar graph, we can see that as the number of rooms increases, the cold rent increases. But in the case of 3.7 rooms (don't know if there are 2.7 or 3.7 rooms available) the rent is low as compare to others. Let's check for the rent per square meter with rooms. Is there any difference?
rent_pivot = df.pivot_table(index="rooms",values="rent_per_square_meter")
rent_pivot.plot.bar()
plt.show()
In the above plot, we can again see that 3.7 rooms have a lower rent price as compare to others. We can also replace 2.7 and 3.7 with 2.5 and 3.5 but we will keep it like this. The rent per square meter bar graph looks good, 2.0 and 2.5 rooms per square meter price is lower than the 1.5 room.
I am very curious about 2.7 and 3.7 rooms. Let's check how many 2.7 and 3.7 rooms are there in a dataset.
percent_value_counts(df, 'rooms')
Now we can see that there are total of 7 and 0.06% of 2.7 rooms and the total of 2 and 0.02% of 3.7 rooms present in the dataset. I believe this is some typing error or mistake. We can simply replace 2.7 rooms with 2.5 and 3.7 rooms with 3.5 as round numbers.
# replacing the 2.7 and 3.7 rooms with 2.5 and 3.5
df.rooms= df.rooms.astype(str).str.replace('3.7','3.5',regex=True)
df.rooms= df.rooms.astype(str).str.replace('2.7','2.5',regex=True)
# Checking the replacement
percent_value_counts(df, 'rooms')
# Checking the rent per square meter in the city
pd.set_option('display.max_rows', 500)
df.groupby('city').max()['rent_per_square_meter']
There are so many spelling mistakes, syntax and errors in the column city name. It is not clean and properly managed the names. It could give us the wrong impression and data values. It is not also good for the prediction model. We need to get rid of these spelling mistakes and syntax. Somewhere it is written Hamnurg, Hammburg etc. Somewhere it is like Altona-Altstadt and the same is Hamburg Altona-Altstadt. We need to clean this cloumn.
df = df[df.city != 'H']
df = df[df.city != 'hgttfrf']
df = df[df.city != 'osdorf']
df.city= df.city.str.replace('Hamburg/','Hamburg ',regex=True)
df.city= df.city.str.replace('Hamburg /','Hamburg ',regex=True)
df.city= df.city.str.replace('Hamburg-','Hamburg ' ,regex=True)
df.city= df.city.str.replace('Hamburg -','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg,','Hamburg ',regex=True)
df.city= df.city.str.replace('HAMBURG -','Hamburg',regex=True)
df.city= df.city.str.replace('Hambrug','Hamburg ',regex=True)
df.city= df.city.str.replace('Hambur','Hamburg ',regex=True)
df.city= df.city.str.replace('Ha,mburg','Hamburg ',regex=True)
df.city= df.city.str.replace('Hamburghttps://www.immobilienscout24.de/scoutmanag','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg https://www.immobilienscout24.de/scoutmanag','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburghttps://www.immobilienscout24.de/scoutmanag','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg g','Hamburg ',regex=True)
df.city= df.city.str.replace('Hamburg t','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg ug','Hamburg',regex=True)
df.city= df.city.str.replace('Hanburg','Hamburg',regex=True)
df.city= df.city.str.replace('Harburg','Hamburg',regex=True)
df.city= df.city.str.replace('HAMBURG','Hamburg',regex=True)
df.city= df.city.str.replace('Hmaburg','Hamburg',regex=True)
df.city= df.city.str.replace('20535 hamburg','Hamburg',regex=True)
df.city= df.city.str.replace('Mitte - Hamburg Billstedt','Hamburg Billstedt',regex=True)
df.city= df.city.str.replace('22455','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg ','Hamburg',regex=True)
df.city= df.city.str.replace('-Bramfeld','',regex=True)
df.city= df.city.str.replace('Hamnurg ','Hamburg',regex=True)
df.city= df.city.str.replace('Hamnurg','Hamburg',regex=True)
df.city= df.city.str.replace('Hammburg','Hamburg',regex=True)
df.city= df.city.str.replace('Hamrburg','Hamburg',regex=True)
df.city= df.city.str.replace('hamburg-','Hamburg ',regex=True)
df.city= df.city.str.replace('hamburg','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburghttps://www.immobilienscout24.de/scoutmanag','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg Hamburg','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg HafenCity, Osakaallee 6','Hamburg',regex=True)
#df.city= df.city.str.replace('Altona-Nord','Hamburg Altona-Nord',regex=True)
df.city= df.city.str.replace('Altona-Altstadt','Hamburg Altona-Altstadt',regex=True)
df.city= df.city.str.replace('Hamburg/ Wilstorf','Hamburg Wilstorf',regex=True)
df.city= df.city.str.replace('Bramfeld, Hamburg','Hamburg',regex=True)
df.city= df.city.str.replace('Hamburg ','Hamburg ',regex=True)
# df['city'] = df['city'].str.strip()
# df['city'] = df['city'].str.lstrip()
# df['city'] = df['city'].str.rstrip()
# #df['city'] = df['city'].str.replace(" "," ")
# lets check it again
pd.set_option('display.max_rows', 500)
df.groupby(('city'), sort=False)['rent_per_square_meter'].max()
Now it looks much better than the previous one. We removed the empty spaces, hyphens, commas and cleaned some names in a proper manner or format. Hamburg shows the high rent rates of 35.496 per square meter. There are few cities which have higher rent in the 30's. Let's see the same in a plot, that would be much clearer and visible than this.
Now we will use the plotly graphs to build some beautiful and interactive charts, we are using plotly because there are so many categories and values and by using plotly we can easily zoom in or zoom out the graphs and whenever there is something which is not clear we can hover the mouse pointer to the graphs and check the values, names, numbers related to that. We can also select and deselect the attributes and check the values.
df.pivot(columns='city', values='rent_per_square_meter').iplot(
kind='box',
yTitle='Rent Per Square Meter',
title='City and Rent')